Pac-Man Gameplay with Proximal Policy Optimization and Generalized Advantage Estimation¶

Part 2/2: Results & Evaluation¶

By: Ken K. Hong

5. Experimental Results and Analysis¶

The agent was trained for a total of 24.5 million timesteps across 30,000 episodes, requiring 14.4 hours in the ALE/Pacman-v5 environment using RAM-based observations. This section presents a detailed analysis of the agent's learning progression, qualitative evaluation of emergent behaviors, and interpretation of policy decisions linked to the underlying RAM state.

Table 2 summarizes model performance across training episodes on a Colab L4 GPU.

Episodes Time (min) Cumulative Time (hr) Cumulative Steps Avg Reward (Last 100 eps) Avg Reward (Unshaped) Eval Reward (100 eps)
1–5000 126.27 2.10 3,590,291 66.60 138.40 23.93
5001–10000 131.67 4.30 7,359,393 197.66 273.05 344.82
10001–15000 144.32 6.70 11,482,844 286.32 368.79 409.53
15001–20000 155.22 9.29 15,879,247 337.27 425.20 381.54
20001–25000 157.55 11.92 20,276,052 377.02 464.96 518.19
25001–30000 151.42 14.44 24,511,479 411.19 495.90 506.12

5.1 Training Progression and Learning Phases¶

The training process over 30,000 episodes can be divided into six phases, each characterized by distinct behavioral patterns, performance trends, and algorithmic dynamics, as shown in Figure 2.

Phase 1: Random Exploration (Episodes 1–5,000)¶

During this initial stage, the policy behaves nearly randomly, frequently resulting in collisions with ghosts and entrapment. This is reflected in low average rewards of 66.6 (shaped) and 138.4 (unshaped), with an evaluation reward of 23.93. High entropy regularization encourages broad exploration; however, no meaningful action-reward associations are yet established. This exploration phase is critical for later training, as it facilitates the discovery of actions that yield high rewards.

Phase 2: Rapid Policy Acquisition (Episodes 5,001–10,000)¶

During this phase, the agent learns fundamental strategies including ghost avoidance and pellet collection. Average rewards increase sharply to 197.7 (shaped) and 273.1 (unshaped), while evaluation rewards reach 344.8. This period is marked by strong advantage estimates driving substantial policy improvements.

Phase 3: Strategic Consolidation (Episodes 10,001–15,000)¶

Navigation becomes more structured and efficient as the agent clears board sections methodically. Average rewards increase to 286.3 (shaped) and 368.8 (unshaped), with evaluation reward stabilizing near 409.5. PPO updates are moderate, reinforcing effective sequential behaviors.

Phase 4: Advanced Adaptation (Episodes 15,001–20,000)¶

The agent adopts higher-level tactics such as strategic power pellet use. Average rewards rise to 337.3 (shaped) and 425.2 (unshaped), while evaluation reward decreases slightly to 381.5, indicating occasional failed experiments and potential overfitting. The value function emphasizes long-term gains over immediate rewards.

Phase 5: Fine-Grained Optimization (Episodes 20,001–25,000)¶

Complex strategies emerge, including ghost herding and route optimization. Improvement slows as performance peaks with average rewards of 377.0 (shaped) and 465.0 (unshaped), and evaluation reward reaching 518.2. PPO clipping constrains updates, promoting refined improvements.

Phase 6: Performance Plateau (Episodes 25,001–30,000)¶

Agent behavior stabilizes, leading to a performance plateau. Average rewards level at 411.2 (shaped) and 495.9 (unshaped), while evaluation rewards slightly decline to 506.1. This indicates convergence to a local optimum, implying further improvements may need alternative exploration strategies or algorithmic changes.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def plot_results(results):
    line_color_avg = '#3B596A'
    line_color_raw = '#A3D1C8'
    grid_color = '#D3D3D3'
    text_color = '#4A4A4A'
    accent_color = '#3B596A'
    rewards_list = results
    moving_average = [np.mean(rewards_list[j:j+100]) for j in range(len(rewards_list) - 99)] if len(rewards_list) >= 100 else []
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.plot(rewards_list, alpha=0.4, color=line_color_raw, label='Episode Reward')
    if moving_average:
        ax.plot(range(99, len(rewards_list)), moving_average, color=line_color_avg, linestyle='-', alpha=0.8, lw=2.5, label='100-Episode Moving Average')
    phase_intervals = {
        (1, 5000): "Random \nExploration",
        (5001, 10000): "Rapid Policy\n Acquisition",
        (10001, 15000): "Strategic \nConsolidation",
        (15001, 20000): "Advanced \nAdaptation",
        (20001, 25000): "Fine-Grained \nOptimization",
        (25001, 30000): "Performance \nPlateau"
    }
    for (start, end), phase_name in phase_intervals.items():
        midpoint = (start + end) / 2
        ax.text(midpoint, 680, phase_name, color=accent_color, fontsize=14, rotation=0, fontweight='bold', ha='center')
        ax.axvline(x=end, color=accent_color, linestyle='--', lw=1.5, alpha=0.8)
    ax.set_title(f"Training: {'pacman_ppo_model'} | Random Seed: 42", fontsize=18, fontweight='bold', pad=15, color=text_color)
    ax.set_xlabel("Episode", fontsize=16, color=text_color)
    ax.set_ylabel("Reward", fontsize=16, color=text_color)
    ax.set_ylim(-100, 1020)
    ax.grid(True, axis='x', color=grid_color, linestyle='-', lw=1, alpha=0.6)

    for spine in ['top', 'right']:
        ax.spines[spine].set_visible(False)
    for spine in ['left', 'bottom']:
        ax.spines[spine].set_color(grid_color)

    ax.tick_params(axis='x', colors=text_color, labelsize=14)
    ax.tick_params(axis='y', colors=text_color, labelsize=14)
    ax.set_yticks(np.arange(int(-100 // 100) * 100, 1001, 100))
    plt.xlabel("Episode", fontsize=18)
    plt.ylabel("Reward", fontsize=18)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    legend = ax.legend(fontsize=14, loc='upper left')
    legend.get_frame().set_edgecolor(grid_color)
    for text in legend.get_texts():
        text.set_color(text_color)
    #plt.savefig(f"{'pacman_ppo_model'}_training_plot.png")
    #print(f"Saved plot to {'pacman_ppo_model'}_training_plot.png")
    plt.tight_layout()
    plt.show()

df = pd.read_csv('pacman_ppo_model_reward_ep30000.csv', index_col=0)
plot_results(df['reward'])
No description has been provided for this image

5.2 Final Model Evaluation¶

The final PPO policy was evaluated over 100 episodes to assess generalization. Training rewards without shaping reached 495.9 by episode 30,000, with evaluation rewards averaging 506.12, indicating stable performance. The highest recorded evaluation score was 812 points, as shown in Figure 3. Although some low-value wafers remain, each worth only one point, the agent typically avoids collecting them. Instead, it takes risks to consume the vitamin, which is worth 100 points. This behavior suggests that appropriate reward shaping could better guide the agent to clear the maze when that is the desired objective.

Despite training rewards increasing between episodes 25,000 and 30,000, evaluation rewards slightly declined from 518.19 to 506.12, suggesting potential overfitting and limited generalization. Qualitative analysis confirms effective power pellet use, ghost avoidance, and pellet collection, though occasional failures in complex scenarios highlight areas for improvement. Overall, results demonstrate strong task proficiency and validate PPO's suitability for medium-complexity environments while underscoring challenges in sustained generalization.

In [4]:
from IPython.display import display, Image
display(Image("/content/Figure3.jpg", width=600, height=400))
No description has been provided for this image
In [ ]:
# !pip install gymnasium[atari]
# !pip install ale-py
In [ ]:
import os
import ale_py
import gymnasium as gym
import matplotlib.animation as animation
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from IPython.display import HTML
from torch.distributions import Categorical
In [ ]:
class PolicyNet(nn.Module):
    """The policy network (Actor) for selecting actions."""
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super(PolicyNet, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def forward(self, state):
        """Returns logits for each action."""
        return self.network(state)

class EvaluationAgent:
    """Agent to load and evaluate a trained PPO model."""
    def __init__(self, state_dim, action_dim, model_path, hidden_dim=256):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.actor = PolicyNet(state_dim, action_dim, hidden_dim).to(self.device)
        checkpoint = torch.load(model_path, map_location=self.device)
        self.actor.load_state_dict(checkpoint['actor_state_dict'])
        self.actor.eval()
        print(f"Model loaded successfully from {model_path}")

    def select_action(self, state):
        """Selects the best action deterministically."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state / 255.0).to(self.device)
            logits = self.actor(state_tensor)
            action = torch.argmax(logits).item()
        return action
In [ ]:
def set_seed(seed, env=None):
    """Sets random seeds for reproducibility."""
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    if env is not None:
        env.reset(seed=seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
In [ ]:
def evaluate_pacman_model(model_path, seed, num_episodes=100, hidden_dim=256):
    """
    Evaluates a PPO agent on the Pac-Man environment.
    """

    gym.register_envs(ale_py)
    env = gym.make("ALE/Pacman-v5", obs_type='ram', render_mode='rgb_array', mode  = 4)
    set_seed(seed, env)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    agent = EvaluationAgent(state_dim, action_dim, model_path, hidden_dim)

    episode_rewards = []
    best_reward = -float('inf')
    best_frames = []

    print(f"\nRunning evaluation for {num_episodes} episodes with seed {seed}...")
    for episode in range(num_episodes):
        state, _ = env.reset(seed=seed + episode)
        done, total_reward = False, 0
        frames = []

        while not done:
            frames.append(env.render())
            action = agent.select_action(state)
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward

        episode_rewards.append(total_reward)
        if total_reward > best_reward:
            best_reward = total_reward
            best_frames = frames

        if (episode + 1) % 10 == 0:
            print(f"--> Episode {episode + 1}/{num_episodes} | Average reward (last 10): {np.mean(episode_rewards[-10:]):.2f}")

    env.close()

    avg_score = np.mean(episode_rewards)
    std_dev = np.std(episode_rewards)
    min_reward = np.min(episode_rewards)

    print("\n" + "="*80)
    print("Evaluation Finished")
    print(f"Average Score: {avg_score:.2f} +/- {std_dev:.2f}")
    print(f"Best Reward:   {best_reward:.2f}")
    print(f"Min Reward:    {np.min(episode_rewards):.2f}")
    print("="*80)

    return {
        "average_score": avg_score,
        "std_dev": std_dev,
        "best_reward": best_reward,
        "min_reward": min_reward,
        "best_frames": best_frames,
        "all_rewards": episode_rewards
    }
In [ ]:
seed = 123
model_path = ['pacman_ppo_model_ep5000.pth', 'pacman_ppo_model_ep10000.pth',
                          'pacman_ppo_model_ep15000.pth', 'pacman_ppo_model_ep20000.pth',
                          'pacman_ppo_model_ep25000.pth',  'pacman_ppo_model_ep30000.pth']

for model in model_path:
    results = evaluate_pacman_model(model_path=model,seed=seed,num_episodes=100,hidden_dim=256)
Model loaded successfully from pacman_ppo_model_ep5000.pth

Running evaluation for 100 episodes with seed 123...
--> Episode 10/100 | Average reward (last 10): 23.20
--> Episode 20/100 | Average reward (last 10): 23.00
--> Episode 30/100 | Average reward (last 10): 25.50
--> Episode 40/100 | Average reward (last 10): 22.50
--> Episode 50/100 | Average reward (last 10): 24.30
--> Episode 60/100 | Average reward (last 10): 25.20
--> Episode 70/100 | Average reward (last 10): 21.80
--> Episode 80/100 | Average reward (last 10): 24.10
--> Episode 90/100 | Average reward (last 10): 25.90
--> Episode 100/100 | Average reward (last 10): 23.80

================================================================================
Evaluation Finished
Average Score: 23.93 +/- 5.02
Best Reward:   38.00
Min Reward:    20.00
================================================================================
Model loaded successfully from pacman_ppo_model_ep10000.pth

Running evaluation for 100 episodes with seed 123...
--> Episode 10/100 | Average reward (last 10): 345.80
--> Episode 20/100 | Average reward (last 10): 347.70
--> Episode 30/100 | Average reward (last 10): 343.20
--> Episode 40/100 | Average reward (last 10): 344.80
--> Episode 50/100 | Average reward (last 10): 343.70
--> Episode 60/100 | Average reward (last 10): 340.20
--> Episode 70/100 | Average reward (last 10): 343.00
--> Episode 80/100 | Average reward (last 10): 346.30
--> Episode 90/100 | Average reward (last 10): 347.10
--> Episode 100/100 | Average reward (last 10): 346.40

================================================================================
Evaluation Finished
Average Score: 344.82 +/- 8.89
Best Reward:   386.00
Min Reward:    335.00
================================================================================
Model loaded successfully from pacman_ppo_model_ep15000.pth

Running evaluation for 100 episodes with seed 123...
--> Episode 10/100 | Average reward (last 10): 380.20
--> Episode 20/100 | Average reward (last 10): 418.30
--> Episode 30/100 | Average reward (last 10): 415.90
--> Episode 40/100 | Average reward (last 10): 391.70
--> Episode 50/100 | Average reward (last 10): 402.80
--> Episode 60/100 | Average reward (last 10): 414.50
--> Episode 70/100 | Average reward (last 10): 469.40
--> Episode 80/100 | Average reward (last 10): 413.80
--> Episode 90/100 | Average reward (last 10): 386.80
--> Episode 100/100 | Average reward (last 10): 401.90

================================================================================
Evaluation Finished
Average Score: 409.53 +/- 50.19
Best Reward:   639.00
Min Reward:    273.00
================================================================================
Model loaded successfully from pacman_ppo_model_ep20000.pth

Running evaluation for 100 episodes with seed 123...
--> Episode 10/100 | Average reward (last 10): 400.00
--> Episode 20/100 | Average reward (last 10): 360.30
--> Episode 30/100 | Average reward (last 10): 358.60
--> Episode 40/100 | Average reward (last 10): 404.50
--> Episode 50/100 | Average reward (last 10): 361.20
--> Episode 60/100 | Average reward (last 10): 381.00
--> Episode 70/100 | Average reward (last 10): 375.90
--> Episode 80/100 | Average reward (last 10): 373.70
--> Episode 90/100 | Average reward (last 10): 432.10
--> Episode 100/100 | Average reward (last 10): 368.10

================================================================================
Evaluation Finished
Average Score: 381.54 +/- 68.28
Best Reward:   790.00
Min Reward:    345.00
================================================================================
Model loaded successfully from pacman_ppo_model_ep25000.pth

Running evaluation for 100 episodes with seed 123...
--> Episode 10/100 | Average reward (last 10): 496.30
--> Episode 20/100 | Average reward (last 10): 515.70
--> Episode 30/100 | Average reward (last 10): 509.50
--> Episode 40/100 | Average reward (last 10): 552.70
--> Episode 50/100 | Average reward (last 10): 512.60
--> Episode 60/100 | Average reward (last 10): 546.80
--> Episode 70/100 | Average reward (last 10): 502.20
--> Episode 80/100 | Average reward (last 10): 480.50
--> Episode 90/100 | Average reward (last 10): 561.20
--> Episode 100/100 | Average reward (last 10): 504.40

================================================================================
Evaluation Finished
Average Score: 518.19 +/- 112.59
Best Reward:   802.00
Min Reward:    345.00
================================================================================
Model loaded successfully from pacman_ppo_model_ep30000.pth

Running evaluation for 100 episodes with seed 123...
--> Episode 10/100 | Average reward (last 10): 481.00
--> Episode 20/100 | Average reward (last 10): 567.00
--> Episode 30/100 | Average reward (last 10): 509.80
--> Episode 40/100 | Average reward (last 10): 566.10
--> Episode 50/100 | Average reward (last 10): 487.10
--> Episode 60/100 | Average reward (last 10): 463.50
--> Episode 70/100 | Average reward (last 10): 448.40
--> Episode 80/100 | Average reward (last 10): 469.60
--> Episode 90/100 | Average reward (last 10): 581.50
--> Episode 100/100 | Average reward (last 10): 487.20

================================================================================
Evaluation Finished
Average Score: 506.12 +/- 132.37
Best Reward:   812.00
Min Reward:    344.00
================================================================================
In [ ]:
fig, ax = plt.subplots(figsize=(8, 6))
ax.axis('off')
im = ax.imshow(results['best_frames'][0])

def animate(i):
    im.set_array(results['best_frames'][i])
    return [im]

anim = animation.FuncAnimation(fig, animate, frames=len(results['best_frames']), interval=50)
## Save the animation as an mp4 file
#anim.save(f"pacman_evaluation_seed{seed}_{results['best_reward']}.mp4", writer='ffmpeg', fps=20)
display(HTML(anim.to_jshtml()))
plt.close(fig)
No description has been provided for this image

6 Discussion and Conclusion¶

6.1 Main Takeaways¶

This project presents a comprehensive evaluation of a PPO agent trained with RAM-based state inputs in the Atari 2600 Pac-Man environment. Over 30,000 episodes on a Colab L4 GPU, the agent progressively developed strategic behaviors within a symbolic, non-spatial state space. The results provide valuable insights into the strengths and limitations of PPO in this gameplay environment.

Training Dynamics and Reward Shaping
The phased learning progression underscores the critical role of reward shaping. Early-stage high-entropy exploration facilitated broad coverage of the state space, while shaped rewards effectively guided the agent toward efficient pellet collection and ghost avoidance. However, occasional declines in evaluation rewards despite improved training performance indicate potential overfitting to the shaped reward signals.

Exploration versus Exploitation
Effective training required a careful balance between exploration and exploitation. The clipped policy updates in PPO helped moderate learning and contributed to the stabilization of complex, multi-step strategies including ghost herding and power pellet utilization. Additionally, the observed plateau in performance toward the end of training indicates that further gains may depend on the incorporation of novel exploration techniques or modifications to the environment.

Generalization and Performance Stability
Evaluation across 100 episodes demonstrated stable policy performance with maximum scores exceeding 800 points. Slight decreases in evaluation scores near the end of training suggest that the learned policy may not fully generalize beyond the training environment. This highlights the potential benefit of more robust techniques, including mechanism design and reward shaping.

6.2 Future Work and Conclusion¶

Algorithmic and Architectural Improvements
Exploring alternative reinforcement learning methods such as A2C or SAC, and combining RAM inputs with visual observations, could improve learning stability and better capture spatial information critical for decision making.

Exploration and Reward Design
Incorporating intrinsic motivation or curiosity-driven exploration alongside refined reward structures, including bonuses for strategic actions, may enhance discovery of important game dynamics and accelerate skill acquisition.

Robust Evaluation and Generalization
Future research should utilize multiple random seeds, varied game layouts, and difficulty settings to strengthen performance reliability and assess policy generalization across diverse scenarios.

References¶

  • Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac, Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, Eddie Calleja, Sunil Muralidhara, and Dhanasekar Karuppasamy. Deepracer: Educational autonomous racing platform for experimentation with sim2real reinforcement learning. arXiv preprint arXiv:1911.01562, 2019. URL: https://arxiv.org/abs/1911.01562.

  • Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708, 2012. URL: https://arxiv.org/abs/1207.4708.

  • Emma Brunskill, Chethan Bhateja, Aishwarya Mandyam, HyunJi (Alex) Nam, Hengyuan Hu, Lansong (Ryan) Li, Shiyu Zhao, and Keenon Werling. CS234: Reinforcement learning, 2025. URL: https://web.stanford.edu/class/cs234/. Accessed: 2025-06-18.

  • Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, and Anca D. Dragan. On the utility of learning about humans for human-AI coordination. arXiv preprint arXiv:1910.05789, 2019. URL: https://arxiv.org/abs/1910.05789.

  • Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on PPO and TRPO. arXiv preprint arXiv:2005.12729, 2020. URL: https://arxiv.org/abs/2005.12729.

  • Farama Foundation. Pacman environment — arcade learning environment. https://ale.farama.org/environments/pacman/, 2025. Accessed: 2025-08-01.

  • Farama Foundation. Lunar lander - gymnasium documentation, 2025. URL: https://gymnasium.farama.org/environments/box2d/lunar_lander/. Accessed: 2025-06-18.

  • OpenAI. Proximal policy optimization, 2017. URL: https://spinningup.openai.com/en/latest/algorithms/ppo.html. Accessed: 2025-06-18.

  • Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. URL: https://arxiv.org/abs/2203.02155.

  • John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. URL: https://arxiv.org/abs/1506.02438.

  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL: https://arxiv.org/abs/1707.06347.

  • Unsloth. Reinforcement learning guide, 2025. URL: https://docs.unsloth.ai/basics/reinforcement-learning-guide. Accessed: 2025-06-18.

  • Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021. URL: https://arxiv.org/abs/2103.01955.